Search CORE

2 research outputs found

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Author: Chau Duen Horng
Helbling Alec
Hull Matthew
Phute Mansi
Publication venue
Publication date: 14/08/2023
Field of study

Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model

arXiv.org e-Print Archive

Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Author: Chau Duen Horng
Cornelius Cory
Duggal Rahul
Hull Matthew
Li Kevin
Martin Jason
Peng ShengYun
Phute Mansi
Xu Weilin
Publication venue
Publication date: 31/08/2023
Field of study

Our research aims to unify existing works' diverging opinions on how architectural components affect the adversarial robustness of CNNs. To accomplish our goal, we synthesize a suite of three generalizable robust architectural design principles: (a) optimal range for depth and width configurations, (b) preferring convolutional over patchify stem stage, and (c) robust residual block design through adopting squeeze and excitation blocks and non-parametric smooth activation functions. Through extensive experiments across a wide spectrum of dataset scales, adversarial training methods, model parameters, and network design spaces, our principles consistently and markedly improve AutoAttack accuracy: 1-3 percentage points (pp) on CIFAR-10 and CIFAR-100, and 4-9 pp on ImageNet. The code is publicly available at https://github.com/poloclub/robust-principles.Comment: Published at BMVC'2

arXiv.org e-Print Archive